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The CP-PACS project is a five year plan, which formally started in April 1992 and has been completed in 
March 1997, to develop a massively parallel computer for carrying out research in computational physics with 
primary emphasis on lattice QCD. The initial version of the CP-PACS computer with a theoretical peak speed 
of 307 GFLOPS with 1024 processors was completed in March 1996. The final version with a peak speed of 614 
GFLOPS with 2048 processors was completed in September 1996, and has been in full operation since October 
1996. We describe the architecture, the final specification, the hardware implementation, and the software of the 
CP-PACS computer. The CP-PACS has been used for hadron spectroscopy production runs since July 1996. The 
performance for lattice QCD applications and the LINPACK benchmark are given. 
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1. Introduction 

Numerical studies of lattice QCD have devel- 
oped significantly during the past decade in par- 
allel with the development of computers. Of par- 
ticular importance in this regard has been the 
construction of dedicated QCD computers (see 
for reviews Ref.[|l[) and the move of commer- 
cial vendors toward parallel computers in recent 
years. In Japan the first dedicated QCD com- 
puter was developed in the QCDPAX project 0. 
The QCDPAX computer with a peak speed of 
14GFL0PS is actually the 5th computer in the 
PAX project 1^, which pioneered the develop- 
ment of parallel computers for scientific and en- 
gineering applications in Japan. 

The CP-PACS project was conceived as a suc- 
cessor of the QCDPAX project in the early sum- 
mer of 1991. The project name CP-PACS is an 
acronym for Computational Physics by a Parallel 
Array Computer System. The aim of the project 
was to develop a massively parallel computer for 
carrying out research in computational physics 
with primary emphasis on lattice QCD. 

The CP-PACS project started in April 1992, 
and after 5 years, is coming to a conclusion in 
March 1997. Therefore it is timely to overview 
the CP-PACS project and the CP-PACS com- 
puter at this workshop held in the middle of 
March 1997. In this article we present an 



overview of the chronology and the organization 
of the CP-PACS project in Sec. 2, and describe the 
details of the CP-PACS computer including the 
architecture, the final specification, the hardware 
implementation, and the software in Sec. 3. Re- 
search areas which are covered by the CP-PACS 
project are given in Sec. 4. In Sec. 5 the perfor- 
mance of the computer for lattice QCD appli- 
cations as well as for the LINPACK benchmark 
are given. Physics results obtained on the CP- 
PACS computer are presented in other contribu- 
tions 1^,^. Sec. 6 is devoted to conclusions. 

2. The CP-PACS Project 

The CP-PACS project Q aims at developing a 
massively parallel computer designed to achieve 
high performance for numerical research of the 
major problems of computational physics. It fur- 
ther aims at significant progress in the solution 
of these problems through the application of the 
computer upon completion of its development. 

The planning of the project was started in the 
summer of 1991. The proposal, made to the Min- 
istry of Education, Science and Culture, was ap- 
proved in the spring of 1991 as one of projects of 
the Ministry's "Program for New Development 
of Academic Research". The project formally 
started in April of 1992, and has received about 
2.2 billion yen spread over the five year period 
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Table 1 

CP-PACS Project members 



computer science 




computational physics 


hardware 


software 


particle physics 


astrophysics 


condensed matter 


K. Nakazawa" 


I. Nakata" 


Y. Iwasaki*^ 


S. Miyama° 


S. MiyashitaP 


H. Nakamura'' 


Y. Yamashita'^ 


A. Ukawa* 


T. Nakamura' 


M. Imada? 


T. Boku'^ 


Y. Oyanagi-'' 


K. Kanaya'^ 


M. Umemura^ 


K. Nemoto'' 


T. Hoshino'* 


T. Kawai^* 


S. Aoki'= 


Y. Nakamoto'^ 


A. Oshiyama*^ 


T. Shirakawa'* 


M. Mori'' 


T. Yoshie^ 




S. Gunji^ 


K. Wada'^ 


Y. Watase* 


M. Okawa' 






M. Yasunaga^ 


S. Ichii^ 


N. Ishizuka"^ 






S. Sakai'^ 




M. Fukugita™ 
H. Kawai" 







" Department of Computer Science, University of Electro-Communications 
Center for Advanced Science and Techology, University of Tokyo 
Center for Computational Physics, University of Tsukuba 
Institute of Engineering Mechanics, University of Tsukuba 
Institute of Information Sciences and Electronics, University of Tsukuba 

^ Department of Information Science, University of Tokyo 

^ Department of Physics, Keio University 
Department of Engineering, University of Tokyo 

' Data Handling Division, KEK 

-' Computer Center, University of Tokyo 
Institute of Physics, University of Tsukuba 

' Numerical Theory Division, KEK 

Yukawa Institute for Theoretical Physics, Kyoto University 

" Theory Division, KEK 

° National Astronomial Observatory 

^ Department of Physics, Osaka University 

^ Institute of Solid State Physics, University of Tokyo 
Department of Physics, Hokkaido University 



ending in March 1997. The funding comes from a 
special allocation of the Grant-in- Aid of the Min- 
istry of Education, Science and Culture support- 
ing innovative fundamental research. 

The Center for Computational Physics was 
founded in April 1992 at University of Tsukuba 
to carry out the project, as well as to promote re- 
search in computational physics and parallel com- 
puter science. The Center is an inter-university 
facility open to researchers in academic institu- 
tions in Japan. 

The number of the project members, which was 
22 when the project started, has increased to 33, 
of which 15 are computer scientists and 18 are 
physicists, as listed in Table |l]. The projected 
was headed by Y. Iwasaki. The development of 
the CP-PACS computer was led by K. Nakazawa. 



A unique feature of the project, as is clear from 
Table is its emphasis on cross-disciplinary re- 
search involving both physicists and computer 
scientists. This is a tradition carried over from 
the QCDPAX project which is the predeces- 
sor and stepping stone for the CP-PACS project. 
A close collaboration of researchers from the two 
disciplines has been both important and fruitful 
in reaching a design for the CP-PACS computer 
which best balances the computational needs of 
physics applications with the latest of computer 
technologies. 

Development of a massively parallel computer 
requires advanced semiconductor technology. We 
discussed the aim of the project with a number of 
manufacturers and invited proposals in the period 
of 1991-1992. We selected Hitachi Ltd. as the 
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industrial parter through a formal bidding pro- 
cess in the early summer of 1992, and we have 
worked in a close collaboration for the hardware 
and software development of the CP-PACS com- 
puter. The fundamental design of the computer 
was laid down in 1992, its details worked out in 
1993, the logical design and the physical pack- 
aging design completed in 1994, and chip fabri- 
cation and assembling of parts started in early 

1995. The first stage of the CP-PACS computer 
consisting of 1024 processing units with a peak 
speed of 307 GFLOPS was completed in March 

1996. An upgrade to a 2048 system with a peak 
speed of 614GFLOPS was completed at the end 
of September 1996 



3. CP-PACS Computer 

3.1. Architecture 

The CP-PACS computer is an MIMD (Multiple 
Instruction-streams Multiple Data-streams) par- 
allel computer with a theoretical peak speed of 
614GFLOPS and a distributed memory of 128 
Gbyte. The system consists of 2048 processing 
units (PU's) for parallel floating point process- 
ing and 128 1/0 units (lOU's) for distributed 
input/output processing. These units are con- 
nected in an 8 X 17 X 16 three-dimensional array by 
a three-dimensional crossbar network. The speci- 
fication of the CP-PACS computer is summarized 
in Table |. 

The basic strategy we have adopted for the de- 
sign is the usage of a fast RISC micro-processor 
for high arithmetic performance at each node and 
a linking of nodes with a flexible network so as to 
be able to handle a wide variety of problems in 
computational physics. The unique features of 
the CP-PACS computer reflecting these goals are 
represented by the special node processor archi- 
tecture called pseudo vector processor based on 
slide-windowed registers (PVP-SW) and the 
choice of a three-dimensional Hyper Crossbar net- 
work. A well-balanced performance of CPU, net- 
work and I/O devices supports the high capability 
of CP-PACS for massively parallel processing. 



Table 2 



Specification of the CP-PACS computer 



peak speed 


614Gflops(64 bit data) 


main memory 


128GB 


parallel architecture MIMD with 




distributed memory 


number of nodes 


2048 


node processor 


HP PA-RISCl.l+PVP-SW 


:^FP registers 


128 


clock cycle 


ISOMHz 


1st level cache 


16KB(I)+16KB(D) 


2nd level cache 


512KB(I)-}-512KB(D) 


network 


3-d crossbar 


node array 


8 X 17 X 16* 


through-put 


300MB/sec 


latency 


2.5 ~ 3.1 fisec 


distributed disks 


3.5" RAID-5 disk 


total capacity 


1059GB 


software 




OS 


UNIX, micro kernel 


language 


FORTRAN, C, assembler 


Size 


7.0m(width) x 4.2m(depth) 




X 2.0m(hight) 


Power dissipation 


275 KW maximum 


* including nodes for disk I/O 



3.2. Node Processor 

Each PU of the CP-PACS has a custom-made 
superscalar RISC processor with an architecture 
based on PA-RISC 1.1. In large scale computa- 
tions in scientific and engineering applications on 
a RISC processor, the performance degradation 
occurring when the data size exceeds the cache 
memory capacity is a serious problem. The PVP- 
SW is our solution to this problem, while main- 
taining upward compatibility with the PA-RISC 
architecture. 

A schematic illustration of the PVP-SW ar- 
chitecture is given in Fig. 0. The Slide Win- 
dow mechanism allows the use of a large number 
of physical registers, which is 128 in the case of 
CP-PACS, through a continuously sliding logical 
register window of 32 registers along the physi- 
cal registers. The Preload and Poststorc instruc- 
tions can be issued without waiting for the com- 
pletion of memory access. These features en- 
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Figure 1. Structure of slide- windowed registers. 
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able a pipelined access to main memory which is 
made with multiple interleaved banks, and thus a 
long latency for memory access can be tolerated. 
An efficient vector processing without degrada- 
tion for a very large length of vector-loop is real- 
ized in spite of the superscalar architecture of the 
CP-PACS processor. 

3.3. Netvifork 

The 2048 processors are arranged in a three- 
dimensional 8 X 16 X 16 array. The Hyper Cross- 
bar network is made of crossbar switches in the 
X, y and z directions, connected together by an 
Exchanger at each of the three-dimensional cross- 
ing points of the crossbar array, as illustrated by 
a schematic diagram shown in Fig. ^. Each ex- 
changer is connected to a PU or lOU. Thus any 
pattern of data transfer can be performed with 
the use of at most three crossbar switches. Since 
the network has a huge switching capacity due to 
the large number of crossbar switches, the sus- 
tained data transfer throughput in general appli- 
cations is very high. 

Data transfer on the network is made through 
Remote DMA (Remote Direct Memory Access), 
in which processors exchange data directly be- 
tween their respective user memories with a min- 
imum of intervention from the operating system. 
This leads to a significant reduction in the startup 



Figure 2. Schematic diagram of CP-PACS. 



latency, and a high throughput. 

Inter-node communication is made by message 
passing. Transfer of data within the network 
proceeds via wormhole routing through the ex- 
changers. The direction of routing is fixed to 
X — > y — > z to avoid deadlocks. The bandwidth 
of each crossbar is 300Mbyte/sec. The latency, 
namely the initial overhead due to hardware and 
software combined, for sending and receiving data 
is 2.45, 2.83 and 3.09 ^sec, respectively, for the 
cases of the data transfer through one crossbar 
switch in the x direction, two switches in the x 
and y directions, and three switches in the x,y 
and z directions. 

The network allows a hardware bisection of PU 
arrays in each of the x, y and z directions. Hence 
the full system can be divided up to 8 indepen- 
dent subsystems. 

3.4. Distributed Disks 

The distributed disk system of CP-PACS is 
connected to 128 lOU's on the 8 x 16 plane at 
the end of the y direction of the Hyper Crossbar 
network by a SCSTH bus. RAID-5 disks are used 



Figure 3. Floor-plan of the CPU chip 



for fault tolerance. The lOU's handle parallel file 
I/O requests issued by the PU's in an efhcient 
and distributed way using Remote DMA through 
the Hyper Crossbar network. 

3.5. Connection to the Front End 

The HIPPI connection to the front host is at- 
tached to one of the lOU's. A special FTP pro- 
tocol has been developed for a high speed file 
transfer between the distributed disk system of 
CP-PACS and the disk storage of the front host. 
The peak throughput is 100 Mbytes/sec and the 
effective throughput is about 65 Mbytes/sec in 
the case when the data with a size of 512 Mbyte 
are sent from CP-PACS to the front host or from 
the host to CP-PACS. 

3.6. Hardware Implementation 

The CPU chip is fabricated using 0.3 micron 
CMOS semiconductor technology, with the size 
being 15.7mm x 15.7mm. Fig. y shows the floor 
plan of the chip. The PVP-SW feature is imple- 
mented with 128 floating point registers occupy- 
ing the top left block together with floating point 
execution units. 

The CPU, the storage controller (SC) and the 
network interface adapter (NIA) , are mounted in- 
line on a ceramic multi-chip module of size 5.7cm 
X 7.2cm, which is shown in Fig. ^: the left one is 
the CPU, the central one is the SC and the right 
one is the NIA. The SC and NIA chips are fab- 



Figure 4. Ceramic multichip module of CPU 




Figure 5. One board consists of eight CPU units 



ricated using 0.5 micron CMOS gate-array tech- 
nology. The twelve pieces surrounding them are 
the second-level cache memory chips. 

Eight PU modules together with their DRAM 
memory are mounted on a board of size 45.6cm 
X 62.5cm as shown in Fig. |^. The central piece 
of each of the eight sections is the PU mod- 
ule, now with fins for air-cooling. The other 
white pieces are main memory address/data con- 
trol units. The black pieces are DIM modules of 4 
Mbit DRAM, 64 MByte for each PU. Each board 
has two more chips for the crossbar switches in 
the X direction, and one chip for the clock dis- 
tributer. 

Sixteen PU boards and one lOU board are 
placed vertically on a back plane, and two back 
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planes, one on top of the other, are housed in a 
cabinet. A Crossbar switch in the y direction is 
mounted on the backplane in the cabinet. Cross- 
bar switches in the z direction are mounted on 
separate boards, which are housed in separate 
cabinets. A picture of the CP-PACS computer 
is shown in Fig. ^ 




Figure 6. Outlook of the CP-PACS computer 
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Figure 7. Floor-plan of the CP-PACS computer 

A schematic floor plan of the cabinets are 
shown in Fig. 0. The squares labeled "PU" rep- 
resent cabinets housing the PU and lOU's, and 
those labeled "Z" are for the crossbar switches 
in the z direction. The cabinets labeled "lOA" 



contain I/O adapters. The RAID-5 distributed 
disk system, placed a few meters from the CP- 
PACS, is connected to the lOU's by a SCSI-II 
bus through adapters in the lOA cabinets. The 
system is cooled by air drawn in from beneath the 
cabinets. 

3.7. Software of the CP-PACS 

3.7.1. Operating System 

The CP-PACS computer runs under the UNIX 
OSF/1 operating system. Each node processor, 
however, carries only a kernel based on Mach 
3.0 in order to save memory for user applica- 
tion programs and to avoid performance degrada- 
tion. The kernel handles memory control, inter- 
node communication, process scheduling, inter- 
rupt handling and I/O. The full UNIX interface 
and file server functions are implemented on the 
lOU's. One of the lOU, named the SIOU, con- 
trols the whole system through the network. 

The operating system has several new functions 
added for parallel processing; software partition- 
ing of the processor array so that independent 
programs may be run on different partitions, and 
the generation of processes over a user-specified 
number of nodes to execute a parallel program. 

The file system is logically structured to form 
a single tree for the entire CP-PACS computer. 
The file sets required to execute a single job can 
be distributed over the disks connected to the par- 
allel lOU's so as to reduce I/O overheads. The 
logical and physical mapping of the file system is 
automatically controlled by the operating system. 

3.7.2. Programming Environment 

FORTRAN90, C, C++ and assembly language 
are available for programming on the CP-PACS 
computer. Assembler code can be included as 
a subroutine in a FORTRAN or C code in or- 
der to maximize the performance. Remote DMA 
data transfer through the Hyper Crossbar net- 
work is made by calling special library routines 
for communications. FORTRAN90 and C com- 
pilers generate assembler codes which incorporate 
the PVP-SW enhancement, using the technique 
of modulo scheduling and register coloring. 

The Real-Time Performance Monitor allows an 
on-line check of the performance of the CP-PACS 
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Figure 8. The computing system at the Center 
for Computational Physics. 



in applications. Various data, including the flops 
of each CPU and the busy rate of the network 
can be collected at regular intervals, and can be 
graphically displayed on terminals. 

3.8. Front End and Mass Storage 

The computing system at the Center for Com- 
putational Physics is shown in Fig. || The CP- 
PACS computer is connected by a HIPPI channel 
and Ethernet to the front host, which in turn is 
connected to the disk storage (350 GByte) and a 
tape archive (780 GByte). The front host is a vec- 
tor computer with a peak speed of 256 MFLOPS 
and 1 GByte of main memory. Job requests for 
the CP-PACS are submitted through the front 
host using NQS. Data I/O between the disk stor- 
age and the distributed disk system of the CP- 
PACS is made through the HIPPI channel. Out- 
put data files are sent back to the disk storage at 
the termination of each job request. 



The front host has a disk storage of 350 GByte 
connected by multiple channels to achieve a high 
data transfer throughput. The front host is also 
connected to a magnetic cartridge tape library 
which holds 980 cartridges, each with a capacity 
of 800 MByte. 

The Center computing facility includes a work- 
station cluster connected by a high speed switch. 
One of the workstations functions as a file server 
accessing a RAID- 5 disk system with a total ca- 
pacity of 89 GByte. 

The QCDPAX, which was developed at Univer- 
sity of Tsukuba by the QCDPAX project (1987- 
1990), is also a part of the system. It has been 
in continuous operation since completion for the 
numerical simulation of lattice QCD. 

The computing facilities of the Center are con- 
nected by a LAN consisting of an FDDI loop and 
Ethernet, which in turn is connected to the Uni- 
versity of Tsukuba campus network. 

4. Research Areas in Computational 
Physics 

In computational physics the project aims to 
use the CP-PACS computer for carrying out 
research in the following three areas: particle 
physics, condensed matter physics and astro- 
physics. 

A major goal of the project is to significantly 
advance the numerical study of lattice QCD in 
particle physics. Large-scale numerical simula- 
tions will be pursued with the CP-PACS com- 
puter in order to verify the theory and to ex- 
tract new physical predictions. Since the CP- 
PACS computer with 1024 nodes was completed, 
hadron spectroscopy calculations in the quenched 
approximation as well as in full QCD have been 
intensively performed and physics results are re- 
ported at this wor kshop |,|. 

Important problems in condensed matter 
physics such as strongly interacting electron sys- 
tems, high-temperature super-conductivity, first- 
principles calculations in material properties and 
those in astrophysics such as the formation of 
galaxies and stellar/planetary systems, and the 
gravitational collapse will be also pursued with 
the CP-PACS computer. Preparations for as- 
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Table 3 

Performance for lattice QCD programs 



program 


MFLOPS/PU 
(peak 300 MFLOPS) 


coding 




calculation 


770/ 
(I/O 


1 ni 
lyl 


— 

assembler 




communication 


000/ 
2370 




+ 


I eo / uiacK ivixv soivei 


sustained 


1 nnO/ 
lOO/o 


148 


Fortran 


for Wilson quark matrix 


caiL uiaLioii 


/o 


QQ 






communication 


16% 




Fortran 




sustained 


100% 


84 




conjugate gradient solver 


calculation 


90% 


139 




for Kogut-Susskind quark 


communication 


10% 




Fortran 


matrix 


sustained 


100% 


125 




Heat bath Monte Carlo 


calculation 


96% 


100 




program 


communication 


4% 




Fortran 


for SU(3) gauge theory 


sustained 


100% 


95 




Over-relaxation 


calculation 


91% 


156 




program 


communication 


9% 




Fortran 


for SU(3) gauge theory 


sustained 


100% 


142 




Hybrid Monte Carlo 


calculation 


74% 


151 




program 


communication 


26% 




Fortran 


for full QCD 


sustained 


100% 


112 





trophysics and condensed matter physics appli- 
cations have been started, and will gradually ex- 
pand in time. 

5. Performance 

Our codes for large-scale simulations in lat- 
tice QCD have been written with Fortran 90 
and libraries for data communication. The For- 
tran compiler which has been newly developed 
to incorporate the PVP-SW feature produces ef- 
ficient object codes, achieving typically 90 - 150 
MFLOPS per node, depending on the structure 
of do-loops. We have further developed a hand- 
optimized assembler code for the core part of the 
red/black solver of the Wilson quark matrix. In 
this case the performance reaches 191 MFLOPS 
per node which is about 64 % of the peak speed 
(See Table ||). Even when the overhead due to 
data communication is included, the sustained 
speed in this case is 148 MFLOPS, which is about 
a half of the peak speed. The performance for 



typical application programs in lattice QCD is 
shown in Table |3[ 

We have also measured the performance for the 
LINPACK benchmark. The results are summa- 
rized in Table ^. The sustained speed for the case 
of 2048 PU's is 368.2 GFLOPS, which is 59.9% 
of the theoretical peak speed. 

6. Conclusions 

We have been able to develop a massively par- 
allel computer of a peak speed of 614 GFLOPS 
through a very effective collaboration of computer 
scientists, physicists and a vendor. Throughout 
the development phase we held a joint meeting 
at least once a month, discussing every aspects 
of the CP-PACS computer from the architectural 
design to the details of the hardware implemen- 
tation. 

The CP-PACS computer achieves high perfor- 
mance of 40 - 50 % of the peak speed for lattice 
QCD application programs. The machine is very 
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Table 4 

Performance of LINPACK benchmark 



NO. of 
Procs. 


Rmax 

(GFLOPS) 


Nmax 

(order) 


Nl/2 

(order) 


Rpeak 
(GFLOPS) 


Rmax 

/Rpeak(%) 


1 


0.1969 


2340 


360 


0.3 


65.6 


2 


0.3873 


3240 


600 


0.6 


64.5 


4 


0.7704 


4680 


960 


1.2 


64.2 


8 


1.527 


6480 


1440 


2.4 


63.6 


16 


3.022 


9360 


2160 


4.8 


62.95 


32 


6.022 


12960 


3360 


9.6 


62.7 


64 


12.0 


18720 


4800 


19.2 


62.5 


128 


23.9 


25920 


6720 


38.4 


62.2 


256 


46.81 


37440 


9600 


76.8 


61.0 


512 


93.99 


51840 


15360 


153.6 


61.2 


1024 


186.5 


74880 


21120 


307.2 


60.7 


2048 


368.2 


103680 


30720 


614.4 


59.9 



stable and we are obtaining interesting results on 
hadron spectrum in the quenched QCD as well 
as in full QCD. Preparations for astrophysics and 
condensed matter physics applications have also 
started. 
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